Jump to content

Talk:UTF-8/Archive 2

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Archive 1Archive 2Archive 3Archive 4Archive 5

XML preference for UTF-8

I have re-read the XML specification section on encoding [1], and I cannot find anything in it that supports the idea that UTF-8 is preferred over UTF-16. This was a long discussion, with W3C explicitly deciding to be neutral between those two (which I personally think is a mistake). I think it's correct to say that UTF-8 is the most used variant on the Web, but I'm neither aware of a relevant published statistic nor aware of what the situation is in non-Web situations. --Alvestrand (talk) 05:56, 19 May 2009 (UTC)

Except the deleted paragraph pretty clearly says "UTF-8 *AND* UTF-16". UTF-16 is not the only competitor to UTF-8 you know. —Preceding unsigned comment added by Spitzak (talkcontribs) 04:12, 21 May 2009 (UTC)
You apparently failed to read to the end of the sentence that you reinserted. That sentence clearly stated that UTF-8 is the preferred encoding for XML (and HTML), despite the fact that nothing in the XML specification gives any preference to UTF-8 over UTF-16. I have removed all the unreferenced parts of the sentence, leaving only the statement that UTF-8 and UTF-16 are the standard encodings for XML. --Zundark (talk) 07:54, 21 May 2009 (UTC)
Can somebody search the history for the original paragraph? I obviously screwed it up by trying to merge the XML and HTML paragraphs and this meant UTF-16 had to be mentioned, and did not look at what the ref pointed to. This has veered off into completely irrelevant statements. The desired implication is "your web pages are almost certainly in UTF-8" without being technically inaccurate. —Preceding unsigned comment added by 64.183.15.38 (talk) 19:39, 22 May 2009 (UTC)
The statement that "your web page is almost certainly UTF-8" seems verifiable. I found the following statement from the World Wide Web Consortium: [2] - the graph (from Google) seems to indicate ~25% ASCII, ~25% UTF-8 and ~25% 8859-*, with UTF-16 simply not being on the map. --Alvestrand (talk) 20:43, 22 May 2009 (UTC)

Use of the term byte in this article

The definition of a byte is a character - NOT a tupel of 8 bits. That is an octet. This article should be rewritten using the correct usage of octect, when a group of 8 bits is meant, and byte when a character (e.g. UTF-8 character) is meant. 93.219.160.251 (talk) 14:09, 27 June 2009 (UTC)Martin

In the ancient times, there were machines with 6, 7, 8 and 9 bits to the byte. These are now only found in the computer history museums. In practice, byte is no longer ambiguous. But feel free... octet IS more precise. --Alvestrand (talk) 14:20, 27 June 2009 (UTC)
A byte is the unit of memory addressing, typically 8 bits. A byte is most definitely NOT defined as a character. Google/Bing on "multi-byte character" for evidence. Leotohill (talk) 17:00, 27 June 2009 (UTC)

Octet is more of a French word than it is an English word. Since "byte" is used in almost all contexts by actual English-speaking computer professionals, it should be preferred in this article, unless it it creates ambiguity (which I don't see). AnonMoos (talk) 20:21, 27 June 2009 (UTC)

Agreed, "octet" is not normally used in English, and in modern usage "byte" is unambiguously 8 bits. The Glossary of Unicode terms states:
Byte. (1) The minimal unit of addressable storage for a particular computer architecture. (2) An octet. Note that many early computer architectures used bytes larger than 8 bits in size, but the industry has now standardized almost uniformly on 8-bit bytes. The Unicode Standard follows the current industry practice in equating the term byte with octet and using the more familiar term byte in all contexts.
It would be confusing to change "byte" to "octet", and definitely wrong to call UTF-8 characters "bytes". BabelStone (talk) 22:21, 27 June 2009 (UTC)

Archiving

Since this talk page has passed 100 Kbytes, I'm setting up archiving for it. --Alvestrand (talk) 14:23, 27 June 2009 (UTC)

slightly ambiguous text

UTF-16 requires surrogates; an offset of 0x10000 is subtracted, so the bit pattern is not identical with UTF-8 does this mean that the offset is subtracted in utf-8 or does it mean that it is subtracted in utf-16 and not utf-8? —The preceding unsigned comment was added by Plugwash (talkcontribs) 2004-12-04 22:38:47 (UTC).

Normal date format: 22:38, 04 December 2004
The offset of 0x10000 is part of how UTF-16 encodes a character greater than U+FFFF. If you are translating UTF-16 to UTF-8, you would recreate the Unicode code point by appending the two UTF-16 surrogate halves and then adding 0x10000. You would then split the resulting number into the 4 bytes of UTF-8.Spitzak (talk) 23:31, 7 July 2009 (UTC)

characters in 2 bytes

How was the choice which characters will fit in two bytes made? Simply by unicode ordiering of them? Wouldn't that be obviously seen as suboptimal? It seems strange not even a single ISCII based layout could fit there (for eg defaulting to devanagari), but there was room for (if unicode order is the case) practically every conceivable precomposed latin character, even though they are used sparingly in some language using latin script, as well as coptic, syriac, armenian, Tāna (latter apparently used by the 300.000 inhabitants of maldives). In each case the language is penalised already by 100% increase in size from 1 byte to 2 byte, yet that's at least not worse than using UTF-16 or is better due to 1 byte space and such - but an avoidable 200% only for such widely used ones?..I presume it also includes N'Ko? 78.0.213.113 (talk) 14:33, 3 July 2009 (UTC)

UTF-8 is not a separate character set from Unicode; it is just a very mathematically-simple method of expressing Unicode in a kind of byte-serialized form. The "choice of characters which fit into two bytes" was made on the basis of having six available bits in the first byte and 5 available bits in the second byte, so that the largest Unicode character which can be encoded is (2^11)-1, or 2047. Anything else would complicate the conversion from UTF-8 to other formats... AnonMoos (talk) 20:28, 3 July 2009 (UTC)
P.S. The Unicode script blocks were basically ordered to put European scripts first, then Right-to-Left scripts in a block after them, and then everything else. But those decisions were made before UTF-8 was invented, and long before UTF-8 became commonly used... AnonMoos (talk) 20:33, 3 July 2009 (UTC)

description of the algorithm?

Shouldn't there be a description of how the algorithm for generating and reading UTF-8 stuff works? (explaining how the bytes represent codepoints, the surrogate thing etc) --TiagoTiago (talk) 05:13, 23 July 2009 (UTC)

There is a table right at the start of the article! Surrogates are part of UTF-16. I think if you are looking for C code that should be elsewhere than WikipediaSpitzak (talk) 08:39, 24 July 2009 (UTC)

Disadvantage not about UTF-8 in specific?


Disadvantages
  • A badly-written (and not compliant with current versions of the standard) UTF-8 parser could accept a number of different pseudo-UTF-8 representations and convert them to the same Unicode output. This provides a way for information to leak past validation routines designed to process data in its eight-bit representation.

That's kinda like saying the disadvantage of living in houses is that a poorly built house can crumble over you... --TiagoTiago (talk) 05:23, 23 July 2009 (UTC)

It's the manifestation in the software realm of an inherent flaw (though a relatively minor one) in the basic design of UTF-8 -- namely, allowing the possibility of "overlong forms" or alternative representations of the same Unicode codepoint... AnonMoos (talk) 13:32, 23 July 2009 (UTC)
It is UTF-8 specific in that you could invent an 8-bit encoding where there were no erroneous sequences that could be misinterpreted. This would not have the self-synchronizing and easy recognizability of UTF-8, but it would not have this problem. It is true that all multibyte encodings have the same problem as UTF-8, though perhaps UTF-8's design makes dangerous errors more possible. For instance a part way alternative would be for the 2-byte encodings to have a 128 offset added to the value, so the possible range would not intersect the 1-byte encodings, this would have made UTF-8 much safer, and a tiny bit shorter too.Spitzak (talk) 08:37, 24 July 2009 (UTC)
Hypothetical discussion aside, this paragraph should be kept because it's mentioned as a concern in official documents. RFC 3629 has a whole section on it (section 10). -- BenRG (talk) 08:58, 24 July 2009 (UTC)

Replacement for invalid UTF-8 sequences

FWIW, I believe that entire section of the document about replacement as an error handling strategy is either WP:OR and or violates WP:NOTGUIDE unless there's a reliable cite for the "popular" or "more useful" solutions that happen to violate the "standard" method which is to error out.

I should explain what I was trying to do in my wording change. When there's an error I move forward one byte and then scan the data stream until I get to a non "tail" byte. For example, we have 41 96 8B 8E 9B 40. It happens to be a CP1252 string "A–‹Ž›@". The possible ways to decode this are:

  • U+0041 U+FFFD U+FFFD U+FFFD U+FFFD U+0040 - this is what's in the text now.
  • U+0041 U+FFFD U+0040 - I consider the entire "96 8B 8E 9B" chunk to be an invalid sequence of bytes and there's no way we can know if it's intended to represent one, two, or more code units. I thus translated the entire chunk into one U+FFFD.

When it comes to U+DCxx style replacement both of them output U+0041 U+DC96 U+DC8B U+DC8E U+DC9B U+0040 and this method, or to interpret as CP1252 is likely more useful.

All of these are valid as far as I can see as they all "protect against decoding invalid sequences." I'm more concerned about the OR/NOTGUIDE aspect but on the other hand it is rather useful. --Marc Kupper|talk 01:37, 6 October 2009 (UTC)

I think that the text needs to refer to the Unicode Standard on how to deal with invalid UTF-8 sequences. UTR #36 (Unicode Security Considerations) Recommendation A states:
A.Always use the so-called "shortest form" of UTF-8
B.With UTF-8 (or UTF-16) conversion, never consume bytes from well-formed sequences as part of error handling
C.Avoid problematic substitutions for ill-formed substrings.
D.Never go outside of 0..10FFFF16
E.Never use 5 or 6 byte UTF-8.
The reason for the recommendation to "never consume bytes from well-formed sequences as part of error handling" is explained in the UTR #36: UTF-8 Exploits, and necessitated a Conformance Change to the Standard for Unicode 5.1.0 ("Additional Constraints on Conversion of Ill-formed UTF-8"). BabelStone (talk) 09:16, 6 October 2009 (UTC)
Agreed and for the section in the article about the parser points B, C, and D all apply. I don't have time at the moment to think of a clean way to add this to the article text but here's the cite:
Davis, Mark (2008-07-23). "Unicode Technical Report #36, Unicode Security Considerations, Section 3.6 - Recommendations". Unicode.org. Retrieved 2009-10-06. {{cite web}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
Use edit mode to get the {{cite web}} text. --Marc Kupper|talk 19:35, 6 October 2009 (UTC)
((edit conflict) edit and so I moved this below my previous post back up to where I thought I was adding it and will indent Spitzak's reply below)
Additional note, http://www.unicode.org/reports/tr36/#Ill-Formed_Subsequences can be used as a cite for the article's replacement section. We just need to get rid of peacock/unencyclopedic wording such as "popular" and "more useful" from the article and the section will be in good shape. --Marc Kupper|talk 19:59, 6 October 2009 (UTC)
If more than one byte as an error is that if you concatenate two strings, the first with an error at the end and the second with a good character at the start, the resulting translation is not equal to the concatenated translations.
The most popular one is the tempting idea that characters start with start bytes and consume all continuation bytes after them, as it makes scanning forward trivial, but this has the synchronization problem that you may have to search back an unlimited amount to find the start of the current character, and that you must look at least one byte past the end of a character to make sure it really is "good" in that it does not have excess continuation bytes.
Saying a start byte consumes exactly the correct number of bytes, no matter what, requires searching back to the start of the string.
Saying a start byte consumes up to the correct number of continuation bytes would limit backwards searching to a finite span, I think. There still is the problem of what to do with a bare continuation byte. Treating each byte as an error makes these the same case as others.
As you point out, anything that tries to preserve more information other than whether there was an error is impractical unless you do one byte at a time, so you can limit the number of different error replacements to 128.Spitzak (talk) 19:47, 6 October 2009 (UTC)
“If more than one byte as an error…” I tried several possible interpretations and am afraid I can’t parse the first sentence. Could you please restate it? Thanks.
I mean replacing more than one byte with a replacement character. Ie the idea of a leading byte always consuming the next N bytes and if they are not continuation bytes, calling that an error. The alternative is to only call the leading byte the error and then continue parsing with the next one.Spitzak (talk) 19:58, 7 October 2009 (UTC)
Thanks, that makes sense. I was using another alternative was to replace the leading byte and all following continuation bytes an error. This makes most sense when using U+FFFD replacement. It's a little more lossy than a stream of U+FFFD replacements but never results in a loss of valid sequences. A byte at a time U+DCxx replacement is better all around. --Marc Kupper|talk 02:02, 8 October 2009 (UTC)
“The most popular one…” – There are both good and bad points to this method. It can result in a valid UTF-8 decoder.
The main problem is that finding the start of a character from a random pointer may involved reading back to the start of a string. Also concatenation where the second string starts with continuation bytes can turn the last character of the first string into an error, or merge two errors into one. And you need to define what happens with a leading continuation byte.Spitzak (talk) 19:58, 7 October 2009 (UTC)
I had assumed from your description that it was a system that would read the header plus all continuation bytes into a buffer which was then parsed in more detail. The beginning of the buffer is the start of the string and there's never a need to "read back." But, do people actually code this to the point that it's called "popular" and "tempting?" It seems to add extra complications to the point that I can immediately see I would not want to go down that path though someone can still construct a valid UTF-8 decoder this way. --Marc Kupper|talk 02:02, 8 October 2009 (UTC)
The wording I’d used, and I’m fine with the revert, was based on consuming the head byte and then, if the head defines a 2, 3, or 4 byte sequence (5 and 6 are invalid) to peek forwards that number of bytes. If there are any errors (head bytes found in the expected tail, overlong encoding, result would be or greater than 10FFFF, an invalid code point, etc.) then I was suggesting to kick out a replacement for the head and then to peek forwards and consume continuation bytes looking for the next head.
I think is required is to treat each of those continuation bytes as another error. Consuming all continuation bytes means that excess continuation bytes causes an error and this would have the concatenation problem, and also means that finding a character boundary may require an infinite search backwards. Consuming only up to N continuation bytes I think avoids this but is complex and you still need to define what to do with extra continuation bytes. Both will merge errors if you concatenate strings and require a definition of what to do if string starts with a continuation byte.Spitzak (talk) 19:58, 7 October 2009 (UTC)
I did not see that it's required to kick out a replacement for each byte of an invalid sequence. For example, let's say someone is doing U+FFFD replacement and they hit a mechanically valid UTF-8 sequence that contains an invalid Unicode code-point. It's common to replace this with a single U+FFFD as that's exactly what U+FFFD is intended for. Conformance clause C7 even allows us to ignore the sequence. --Marc Kupper|talk 02:02, 8 October 2009 (UTC)
However, the old/current wording of “A more useful solution is to translate the first byte to a replacement and continue parsing with the next byte” eliminates the need for that scan for the next head code as the parser would see and deal with any extra continuation bytes as it runs across them.
No matter what, decoding UTF-8 can be painful if someone is unable to peek forwards, push back, or retain at least a three byte state to deal with an invalid 4-byte sequence. --Marc Kupper|talk 21:18, 6 October 2009 (UTC)
This is avoided with one-byte errors. If you know you are at the start of a character (where an "error" counts as a character) then you only need to look forward the number of bytes defined by the lead byte (or only that byte if it is a continuation). If you don't know you are at the start you can check if the current byte is not a continuation, in which case you are at the start, or if it is a continuation you need to look back up to 3 bytes trying to see if you find a valid encoding that includes the current byte, in which case you are pointing into the character, otherwise you are pointing at an error. This requires no state information.Spitzak (talk) 19:58, 7 October 2009 (UTC)
What you outlined does require state information as it assumes one can "look forward" at least three bytes. Those three bytes are your state information. If you are dealing with a system that does not support look forward by at least three bytes but does support pushing back at least three bytes then then whatever gets pushed back becomes the state information. If you are dealing with a system that does not support looking forwards nor pushing back then the UTF+8 decoder must be able to store at least three bytes of state information. If you are dealing with a system that does not support support looking forwards, pushing back, nor storing of state information then you can't write a perfectly up to spec UTF-8 decoder. You can still write a UTF-8 decoder that'll work just fine on valid UTF-8 but recovery from a mechanically invalid section and re-syncing may result in valid codepoints being lost. The details of possible failure modes in the latter situation are implementation specific and far beyond the scope of this Wikipedia article. FWIW, the latter situation is common with hardware-only decoders or firmware based decoders that don't have three bytes of state storage available for the UTF-8 decoder. --Marc Kupper|talk 02:02, 8 October 2009 (UTC)
One other comment as I had to think about "so you can limit the number of different error replacements to 128" for a bit. I'd done an edit yesterday to change the suggested range from U+DC80..U+DCFF to U+DC00..U+DCFF but now see that the only possible replacements would have the high bit set meaning someone could bit-OR the value into DC80 and then read the lower eight bits to get the original value back. When I changed it to DC00 it does not break anything but allows for using addition or bit-OR to insert the value. However, as all of this should be based on reliable sources I assume we should use whatever those sources say in terms of DC80 vs. DC00. I assume the DCxx range was chosen so that an upper code layer that (wrongly) assumes this is UTF-16 encoded data would handle this correctly as that code would see the DCxx and know it's not part of a surrogate pair. --Marc Kupper|talk 21:36, 6 October 2009 (UTC)
DCxx was chosen because they are UTF-16 lower surrogate halves, and thus an invalid UTF-8 encoding will turn into an invalid UTF-16 encoding in most cases (it will not if the error is preceeded by a UTF-8 encoding of a UTF-16 upper surrogate half, unless you count that as an error, which I very much recommend against as it will break all the people using CESU encodings and make it impossible to name Windows files with UTF-8 because they allow invalid UTF-16).
I have suggested that DC00-DC7F are a good idea for "quoting" ASCII letters. For instance if you really want a slash in your filename but the file system uses slash for some other reason, the character 0xDC2F (0x2F is slash) could be used instead. This would be consistent with this handling of invalid UTF-8. But this is certainly original research so should not be mentioned in Wikipedia.Spitzak (talk) 19:58, 7 October 2009 (UTC)
0xDC2F would be pretty evil but yes, a cool idea. --Marc Kupper|talk 02:02, 8 October 2009 (UTC)

Scanning backwards

I'm confused by the regular references above to the issues surrounding scanning backwards. Why would a decoder need to scan backwards? The only times I could see a need to back up are

  1. The rules of the game are that 1) We are going to put you at a random spot in what potentially is UTF-8 encoded data. 2) If that random spot turns out to be within what seems like a UTF-8 section then you need to decode that section and return its Unicode code-point value if possible. I suspect if someone's dealing with those rules that there are many other rules and it's an implementation specific thing that's likely out of scope for this article.
  2. Your system does not support look-ahead but does support push-back. In that case, then you could get two or more bytes into a decode, realize it's an invalid sequence, and you want to then output U+DCxx bytes for each of the bytes in the invalid sequence. In that case you need to move backwards up to four bytes. You already know exactly how many bytes you need to back up to get to the head of the sequence. No scanning is needed. If you have a one byte buffer available then you can store the head and only need need to back up one to three bytes. You then output/return U+DCxx where xx is the buffered head byte. --Marc Kupper|talk 02:02, 8 October 2009 (UTC)

When I use the term "push-back" it usually means one of two things. 1) The source UTF-8 data is on a device that has the concept of a byte position pointer and allows us to reposition that pointer to a byte boundary. A disk file is a typical example of this. 2) We are receiving UTF-8 data from a device. Normally we only read or receive bytes. However, this device (or its driver) allows us to push at least three bytes back into it so that they are later read back FIFO compared to the bytes we pushed. Windows pipes are like this but I don't think Unix pipes support push-back.

Some devices do not support look-ahead nor push-back. Network sockets (Windows or Unix) and TTY/serial ports (Windows or Unix) and are examples of this. Sometimes I deal with devices usually allow look-ahead or push-back but in specific case won't. I make the assumption that it never supports these and instead use a three state storage.

I am fairly certain that all pushback schemes are done locally, they do not somehow modify the original device. The only difference between sockets and files is that the library reading them is not supporting pushback. Most C FILE implementations can pushback quite a few bytes, but the C standard only supports 1.
There are a number of UART chips that support push back (sometimes up to eight characters) but you right in that generally pushback is implemented locally. --Marc Kupper|talk 08:24, 9 October 2009 (UTC)

With none of these would I ever need to scan backwards. --Marc Kupper|talk 02:02, 8 October 2009 (UTC)

This had nothing to do with pushback. It is impossible to correctly read UTF-8 with an API that returns single glyphs without pushback, as you will have consumed at least one more byte before you detect that an error has occured. However I very much doubt there are any APIs that do one character at a time so this is irrelevant. Also conversion on reading from a serial source is very bad because the data must pass through a lossy conversion which can lead to security holes or DOS errors. Conversion should always be done from a source you can recover if the conversion fails.
The usage I am trying for, and that I have the most experience with, is an editor keeping a buffer full of 8-bit text. I only need 3 operators: 1-char forward, 1-char backward, and "go to start of character we are pointing at a byte of". The last one is used to implement 1-char backward, and to move a pointer to a random byte so that you know it is pointing at a character, an is extensively used by complex buffer operations such as regexp and locating the character near the mouse. I consider it a requirement that this operation work with a maximum of a finite amount of searching in each direction, and only by treating errors as single bytes is this possible.Spitzak (talk) 22:21, 8 October 2009 (UTC)
There are systems that allow you to look at the data without consuming it. Those don't need pushback. I see a difference in your world and mine here. I deal with communication and interfaces. I do a fair amount of data validation and transformation as part of this. Backwards and forwards often times do not exist in my world. I can't look at the next byte as it does not exist yet and if something needs to be retained about the past I need to have done it myself at that time.
Your comments about the backwards scan now make perfect sense to me. To make this conversation useful to the Wikipedia article we should see if there are reliable sources that explain forwards and backwards synchronization and some of the issues a person may run into. It'll expose both pros and cons of UTF-8. --Marc Kupper|talk 09:38, 9 October 2009 (UTC)

Advantages/Disadvantages

Somebody added two "who?" citations. The "who" is "people who keep posting incorrect information to this page" and decidedly NOT an "informative source". I don't know how to correctly state this but we certaily do NOT want a reference to somebody saying this.

The one about CP1252 is an attempt to stop the continuous posting of "UTF-8 sucks I can do everything with 1-byte codes" posters. They do, from their pov, have a complaint, so I think a paragraph is deserved there.

The one about "fixed size" is to stop the repeated posting of "strlen(x) is really slow with UTF-8". These people have not tried programming with UTf-8 and are making very dangerous and incorrect assumptions about how it should work. These people are dangerous in that they are usually smart enough to actually implement what they have in mind, resulting in very slow, lossy, and insecure software. This has been edited repeatedly by me and many others to delete or correct any indication that distances in a string somehow must be measured in "characters" but for awhile it was almost continuous maintenence beceause so many "helpful" people kept trying to "fix" this. It has slowed down some, possibly because Windows programmers are realizing that UTF-16 is variable length yet Windows uses "number of words" to measure the length, but I also think the current wording pointing out that it is the fault of ASCII-era documentation has helped as well. Therefore I would like to keep the wording as it appears to be discouraging incorrect edits. The paragraph itself must remain as a disadvantage, as it is true that "number of characters" is harder to determine with UTF-8, but I want to make sure it is clear that this is a MINOR disadvantage and not the "makes programming impossible" problem some think it is.

On further thought it does seem the whole advantages/disadvantages section is too long and pointless. People are well aware of the advantages, and the only alternative even considered today is UTF-16, all single byte and alternative multi-byte and UCS-2 are effectively dead. There is however some useful facts and information and references in there that may want to be preserved:

  • A lot of API's for ASCII require zero changes to work with UTF-8
  • The low chance of confusion with other encodings, thus eliminating the need for BOM
  • The self-synchronizing feature
  • byte-oriented search algorithms work
  • Invalid UTF-16 can be stored losslessly as UTF-8 but converse is false!
  • Please make it absolutely clear that strlen() does not return "characters" and that this is NOT a problem!
  • add a "usage" section
  • add a "size" section (it's a silly subject but lots of people want to know)

Spitzak (talk) 22:21, 8 October 2009 (UTC)

A straightforward solution is to reference reliable sources. If someone adds or restores stuff that's not backed by reliable citations then just rip it out and point them to WP:BURDEN. If someone can find a reliable source that says "UTF-8 SUCK DICK" then they are free to add it to the article though as that appears to be a NPOV statement the article should also name the standards agency or whatever that made that statement. :-)
I agree and support the intent of the Advantages/Disadvantages section but as it's largely not backed by references it's encouraging people to add their own pet theories and opinions as that seems to be the style of that section.
The revised wording still needs "who?" tags as all you did was to remove the explicit "people" and made it implied people. When I see "This is often mistakenly considered important..." I'm immediately thinking "who often mistakenly considered...?" People often try to weasel "authority" into their statements with wording like that which is why it gets tagged. I'll think about about a way to reword this so that there are no implied people. I agree with the concept but not how they are currently expressed.
I'm puzzled by your comments about strlen(). Can you show me a document where it does not say it returns "characters?" The versions that deal with fixed width characters have the edge as most CPUs offer hardware support to scan for the NUL. strlen() on variable width encoding systems are generally not supported by the CPU hardware meaning that it's a software coded scan. With UTF-8 you are looking at each head byte plus you will need logic to resync if needed. I would not be surprised if it's an order of magnitude slower. --Marc Kupper|talk 08:19, 9 October 2009 (UTC)
You are making exactly the same mistake. strlen() is trivially implemented in UTF-8, because it should return the number of bytes before the NUL. Yes lots of documentation calls them "characters" but they also call the argument to malloc() "characters". It is obsolete documentation. Anybody who thinks strlen() should return anything other than fixed-sized units is wrong. Sorry, there just is not any other way to say it. —Preceding unsigned comment added by Spitzak (talkcontribs) 17:04, 9 October 2009 (UTC)
Though some misguided person has dutifully tried to make the gnu documentation say "characters" and "bytes" at various places, it did not take very long to find a place where it is obvious the documention makes no sense unless you say they are equivalent. The documentation says that snprintf returns the "number of charactes" but accepts "number of bytes" and that "if the return value is greater than the passed number then the result was truncated" clearly meaning that they are the same units. I'm sure there are better examples with just a little searching. Also check what all Windows apis do with UTF-16, or what Python claims the len of a "Unicode" string containing non-BMP characters. The implementations are correct, the documentation is wrong.Spitzak (talk) 17:25, 9 October 2009 (UTC)
Dug up the official Posix documentation. They have fixed it correctly, strlen() has been reworded in the last few years to read "compute the number of bytes in the string to which s points, not including the terminating null byte."
This is also the wording of the BSD documentation from running man on a BSD server here.
MSDN documentation also clearly says that there is a different function called "_mbslen" that "returns the number of multi-byte characters", implying that strlen() returns something other than "multi-byte characters", also wcslen() returns the number of UTF-16 words, not characters. —Preceding unsigned comment added by Spitzak (talkcontribs) 17:34, 9 October 2009 (UTC)
Reference to the POSIX standard that seems to be a legit and usable reference: http://www.opengroup.org/onlinepubs/000095399/functions/strlen.html. Quote: "The strlen() function shall compute the number of bytes in the string to which s points, not including the terminating null byte."Spitzak (talk) 19:23, 9 October 2009 (UTC)
Fair enough - I'll need to do some research and testing. I checked a Microsoft C v7 (1989) manual and they document strlen() as returning bytes and malloc() as using size_t. Since at least 2001 to today they say strlen() returns characters and you use things like mblen() to get the byte counts. strlen() for ANSI C (1989 edition at hand) returns characters. It's clear though it's been a point of contention. One argument about UTF-8 is that it's not a char array and so strlen() is irrelevant. Code using it to get the byte length of a UTF-8 string is not portable as it makes assumptions about both the behavior of strlen() and sizeof(char). --Marc Kupper|talk 22:29, 9 October 2009 (UTC)
I think you have strlen and mblen reversed. strlen on Windows returns the number of bytes, mblen is the one that actually decodes the bytes and figures out the number of "characters".Spitzak (talk) 23:40, 23 November 2009 (UTC)
I took a break and read the ANSI C89 standard (doc X3.159-1989). I never got a copy of C99 but in thinking about it now, I should. Most of the standard dances around the subject of bytes. The char type section says nothing for example. However, much of the C library depends on size_t which depends on sizeof() and section 3.3.3.4 The sizeof Operator states that it "yields the size (in bytes) of its operand." It further goes on to say that when sizeof is applied to a char, unsigned char, or signed char, (or a qualified version thereof) is that the result is 1. Thus it all hinges on that section. Curiously, the section then goes on with examples that uses the alloc() function which is not documented in this standard. Thus ANSI C89's strlen() returns bytes. --Marc Kupper|talk 23:33, 9 October 2009 (UTC)
I believe I'm done thinking about strlen(). strlen() can safely be used to return the length of a UTF-8 string in bytes. Some people will consider it "abuse" because 1) UTF-8 seems to be oriented for 8-bit bytes and a char may be larger. However UTF-8 works fine on systems with more then 8 bits per byte (and character) as you are only working with the lower 8 bits of each byte. 2) To make the bit manipulation code easier to read some people will use unsigned char for UTF-8 data and thus need to cast it to (char) when dealing with functions like strlen(). Casts are evil in the eyes of a purist. :-) 3) Some systems document strlen() as returning the length in characters. You are inviting user confusion to use strlen() on UTF-8 data to get the byte length. This will be particularly true on those systems that group strlen() with other functions that return the character and not byte count for the various string types supported by that system. --Marc Kupper|talk 02:02, 11 October 2009 (UTC)
The real problem is the absolute total useless and impracticality of making strlen use anything other than fixed-sized units. This should become blatantly obvious if you actually try to write some software to manipulate UTF-8, rather than just considering it as some piece of data who's only purpose is to be converted into UTF-16 and thrown away.
A "character" is extremely vaguely defined, certainly this Wiki and Unicode page expends thousands of words trying to define it. For far too many they really mean "UTF-16 code points" which should make it perfectly obvious that they are wrong (those are no more "characters" than UTF-8 bytes are). But there is all the questions about handling invalid byte sequences, and it it is valid UTF-8, how to handle noncharacters and invalid code points, and how to handle combining diacriticals and non-spacing characters and invisible codes and control characters. It should be obviously insane to try to define low-level data manipulation routines to use such a complex and ill-defined unit.
The huge problem is the belief that somehow "number of characters" is important. Beginning programmers constantly ask "but a call to move 10 characters forward will be slow" and refuse to come up with an actual description of why they ever would want to do this and what code magically produced this number "10". In reality "10" is only produced by examining the very same string and saying "this interesting point is at a location such that my "move by N characters" function will arrive at it if you give it the number 10". Any possible argument falls apart then because you can just change *both* the producer and consumer to use byte counts and get exactly the same result.
It is really sad that what appear to be obviously intelligent programmers turn into complete morons when they are given UTF-8, likely due to ages of being exposed to documentation that says strings are measured in "characters". One hope is that the UTF-16 crowd is finally learning, due to Windows unambiguously using 16-bit word arrays as UTF-16 and not any other encoding, that measuring using code units is the only way to make it work. Certainly I have seen a shift from shock and disbelief that came out a year ago at the "wrong strlen" returned for non-BMP UTF-16. Perhaps they are getting smarter. I can only hope it is not too late.Spitzak (talk) 17:14, 12 October 2009 (UTC)

Meaning of underlined bits in first table?

The table after the first paragraph of the section titled "Description" has various fields of bits underlined in the examples. The underlining seems random; although only non-control bits seem to be underlined, I can see no pattern as to which non-control bits are underlined, and there is no explanation.

What does this underlining mean? If it has meaning the meaning should be explained in the accompanying text. If it has no meaning it should be removed. -- Dan Griscom (talk) 12:03, 12 October 2009 (UTC)

It is an attempt to show what bits are controlled by the hex digits in the Unicode number. There were other attempts using color and italics and this seemed to work the best, however it is hard to say if it really is working at all and maybe it should just be removed.Spitzak (talk) 16:58, 12 October 2009 (UTC)

Lossless conversion of errors

I have to keep reinserting the advantage that errors in UTF-16 can be stored in UTF-8, but converse is false. This is the ENTIRE reason I am being such a pain in the ass about keeping this page up to date and removing incorrect arguments against UTF-8. I NEED UTF-8 API's to libraries and systems, UTF-16 API's are USELESS because of this fact. I'm sorry I can't find a good reference but believe me this is incredibly vital!Spitzak (talk) 23:51, 15 January 2010 (UTC)

Precomposition and Decomposition

I've proposed that this section be moved to Unicode equivalence, the article dealing with Unicode normalisation. The problems that the section describes are really nothing to do with encoding; any encoding implementation that arbitrarily alters precomposed or decomposed characters will cause the same issue. Similarly, the statement that "Correct handling of UTF-8 requires preserving the raw byte sequence" is incorrect. If my (UTF-8) application uses NFKD, then it will be quite "correct" for it to not preserve incoming UTF-8 byte sequences. Any arguments? -- Perey (talk) 17:21, 18 February 2010 (UTC)

Sure. Precomposition of characters is a matter based on the concept of code points and, hence, lies above the encoding level (which situated logically immediately below the code points). Incnis Mrsi (talk) 18:23, 18 February 2010 (UTC)
I agree that this has nothing to do with UTF8. The raw byte sequences thing was my attempt to make it relevant but I would prefer that this be deleted from here.Spitzak (talk) 20:07, 18 February 2010 (UTC)
What I meant to say about correct handling is that the Samba bug is due to somebody trying to be too clever. People have to treat text as raw data and not try to give it magical properties when it is not necessary. This thinking is the source of the majority of bugs when dealing with Unicode. For UTF-8 I believe this means that as much as possible you should treat it as a string of 8-bit bytes, not as "characters". You only need "characters" if you are trying to render it on a display, or doing analysis that needs EVERY character (not just searching for a slash or something) such as spelling correction. If this is moved to the Unicode page I recommend the wording be changed to something like "changing the normalization is not a good idea unless you really, really, really know the destination will crash when given the wrong normalization". The Samba bug is an excellent example where the "stupid" program would do the right thing. —Preceding unsigned comment added by Spitzak (talkcontribs) 20:12, 18 February 2010 (UTC)
I see what you're saying, Spitzak; ultimately the problem comes down to two applications, both relying on the same data, but one arbitrarily changing its normal form and the other being unable to handle that form. Encoding makes no difference; you'd get the same problem if both ends used (say) UTF-16, if they still couldn't accept one another's normal forms. I've moved the section to Unicode equivalence and made some edits, including retitling the section to "Errors due to normalization differences". And I've changed the last sentence to sum up the problem thusly: "Applications may avoid such errors by preserving input code points, and only normalizing them to the application's preferred normal form for internal use." I'm not sure it's a very clear summary; if anyone would like to try and do better, please do! -- Perey (talk) 17:16, 19 February 2010 (UTC)

Measuring a string length

Spitzak removed[3] a statement. It is not important, do some WP editors think that measurement in Unicode codes is necessary, or does not, or even consider it harmful. It is used. Somewhere bytes may be preferred (offsets in memory) or visible characters (in text editors). But default method in many language (e.g. Perl or JavaScript) is to count code points.

Perl example:

#!/usr/bin/perl -n
chomp;
print "length(\"$_\") = ".length($_)."\n";

JS/HTML example:

<html>
<head><title>String length calculator</title></head>
<body>
<form>String:&nbsp;<input type="TEXT" name="s" size="160"/></form>
<p><a href="javascript:alert('length(\u0022'+document.forms[0].s.value+'\u0022) = '+document.forms[0].s.value.length);">
compute size</a></p>
</body>
</html>

Result, both in Perl with proper locale and in JS:

length("…") = 1
length("…̄") = 2

Incnis Mrsi (talk) 08:34, 30 March 2010 (UTC)

Perl does appear to be screwed up. They are not using UTF-16 but instead trying to simulate it while keeping the original text in UTF-8. That is a really bad idea. They do appear to realize that: http://perldoc.perl.org/perlunicode.html#Speed I think it is also instructive to look at how they did regular expressions, note that they are able to do quite complex matches against UTF-8 without having to decode more than one "character" at a time.
Your JavaScript example seems to show they measure the bytes in UTF-8 which is correct behavior.
Your suggested rule to just count all leading bytes and ignore all continuation bytes will not return the correct number of "characters" if the UTF-8 string contains any erroneous sequences. As "error" is pretty hard to define (do UTF-16 surrogate halves count?) there is in fact no actual solution. Even if you know the string contains no "error" according to the rules of your application, exactly what is wanted? Typically users want the number of UTF-16 words (because really the only possible use for this answer is to allocate a buffer large enough to hold the conversion), but that is rather bogus to call a "character". If for some mysterious reason you really want to determine the number of individual glyphs that will be drawn, you will quickly find that is a very complex operation that depends on many language rules and is really best left to a text rendering library.Spitzak (talk) 18:55, 30 March 2010 (UTC)
My example of JavaScript is, strictly speaking, offtopical. JavaScript seems to use internally UTF-16 not UTF-8 (at least for an environment where I tested it, and it is not Windows), and measures 16-bit "bytes" not actual code points, so:
  • in JS: length("𐌰") = 2,
  • in Perl: length("𐌰") = 1.
I did not "suggest" that rule about omitting 0x80–0xBF. I noted that your comment misleading text that implies that strings must be measured in "characters" was misleading itself. There are applications which, for some reason, attempts to count code points (or 16-bit words of UTF-16, which is the same for BMP). Any such quantity differs from byte size in UTF-8. There is an algorithm to count codes in UTF-8, and you removed mentioning about it. For invalid strings the number of code points is undefined and you should not blame that algorithm for returning something different from you personal expectations. Incnis Mrsi (talk) 22:19, 30 March 2010 (UTC)
What was the previous Javascript example with the ellipsis character? You said it returned 2 but if it is UTF-16 then it should return 1 (unless it counts 16 bits as 2 but then your second example should return 4). Your second example implies that Javascript is doing UTF-16 correctly and considers a non-BMP character to be 2 units long (not a suprise, most UTF-16 is done correctly, ie Windows apis and Python on Windows and TCL and several other languages count each surrogate half as a unit. For reasons I can't quite figure out, this strange brainwashing of programmers seems to be for UTF-8 only, not UTF-16 and not the older multibyte encodings).Spitzak (talk) 00:38, 31 March 2010 (UTC)
Look carefully, please. Length of "" character alone was 1. Length=2 was for a combining character ellipsis with macron (U+2026 U+0304), which occupies the same space in a text. What does this example demonstrate? That there is apparently no notion of character (in Perl nor in JS) different somehow from a unit of code, and that such unit is not octet. Perl length, obviously, differs from size in any encoding except UTF-32. Also, note that size is a clearly defined term in context of given encoding – it is a size in memory, the offset of end of data. But the length is not a synonym and is not defined unambiguously – number of units… what units? An unregistered user wrote about the length, not size. Brainwashing or not, but Perl's approach exist. There are multiple definitions of Unicode length, and you may not suppress some that you dislike in favour of your preferred one. Incnis Mrsi (talk) 09:34, 31 March 2010 (UTC)
We may be confusing things. Except for Perl all the examples are returning what you call "size". A system using UTF-16 should return 2 for a non-BMP character, which so far all your examples do. The two-Unicode code point result should return 2 in UTF-16 and 4 in UTF-8 (assuming both code points are less than 0x800). The Perl result shows "how many times a Unicode composing iterator must be called to reach the end of the string", which is what you want to call "length". I don't like this as the word "length" is used far too often for "how big a varying object is in memory" while "size" is usually reserved for a fixed-sized object. Also you cannot have a "length" as an integer without binding it to the definition of the iterator that produced it, so I very much want to discourage API's that assume this is an integer.Spitzak (talk) 19:24, 31 March 2010 (UTC)


wikipedia css competing with this page

setting my unicode/utf-8 fonts to deja vu I can't view the example glyph in the 4th column on the table in the description ( it looks like japanese kanji, but my browser generally shows that just fine ). I'm guessing the problem is the wikipedia style sheet specifies a font by name, because if I change that font to impact the page still doesn't change. This is what the glyph is supposed to look like: http://www.fileformat.info/info/unicode/char/24b62/index.htm —Preceding unsigned comment added by 173.66.61.35 (talk) 16:44, 17 May 2010 (UTC)

Please provide a better table of allowed ranges.

Hello. The tables currently displayed are difficult to digest. You must calculate the outcome instead of just reading the info. A much better representation would be like this: [start]-[end], [start]-[end] ranges. What happens now is this: there is a range 128-2,047 in the table, which is very confusing because 128 is not a valid Unicode, and, 128 is not represented in binary in two bytes. 128 is a single byte of 10000000 (bits). That is 162 != 0xC2A2. I mean, what happens is that instead of explaining the how 162 transforms into 49826, you put 162 on both sides of the equation. That is, you say 162 = 162 instead of 162 = 49826 & 49664 (which is how you can calculate it and discover what is the value it maps to). Wvxvw (talk) 18:25, 22 May 2010 (UTC)

I'm not sure what you are saying, but 128 certainly is a valid Unicode character, and it certainly is two bytes in UTF-8. I think you are thinking that all characters 0-255 are encoded as one byte, this is false, only the characters 0-127 are encoded as one byte. I do not understand you 162 example at all.Spitzak (talk) 22:42, 22 May 2010 (UTC)
No, 128 is a valid codepoint this is different from integer value and how integers are commonly stored. This is furthermore confusing because one would read and understand 128 as 10000000 (binary) or 0x80 (hexadecimal), and only when you encode a string using Unicode technique the value in binary will consist of two octets. That is, if you write it like so U0080, then it is understood as codepoint, but once it is just a decimal, then it is confusing. Another confusion is added because of most high-level programming languages, which use Unicode to store strings, would convert integers to characters using codepoints (instead of their integer values). Wvxvw (talk) 15:17, 23 May 2010 (UTC)
I agree it probably should not show the decimal numbers under the U+nnnn descriptions.Spitzak (talk) 19:50, 23 May 2010 (UTC)

If you want a complete list of ranges of code points that will display on current browsers, even the list would be massive. There are also odd invalid codes in between valid ones. The small number shown with MS Character Map is only just a small fraction of the code points that will display especially if you add Chinese. There are also a large number of characters such as pF as a single code point, for picoFarards, togther with dozens of other electrical symbols, roman digits to XII in lower and upper case as one code point, the alphabet (English only luckily) encircled, digits 1-20 encircled and so forth. I tried have so far tried everything up to 0xFFFFF on several character sets in common use and on three browsers, Word, Publisher and Notepad and all agreed. (I wrote a short program to do it). If that is the sort of list you had in mind, leave me a note and give me a couple of weeks and I list them for you. Euc (talk) 01:40, 6 July 2010 (UTC)

UTF-8 vs UTF-16 file size

I have serious doubts about this section of bulleted text, currently in the article to detail a disadvantage of UTF-8 that's supposedly rare in practice:

  • Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi could take more space in UTF-8 if there are more of these characters than there are ASCII characters. This rarely happens in real documents, for example both the Japanese and the Korean UTF-8 article on Wikipedia take more space if saved as UTF-16 than the original UTF-8 version

It has this footnote...

The version from 2009-04-27 of ja:UTF-8 needed 50 kb when saved (as UTF-8), but when converted to UTF-16 (with notepad) it took 81 kb, with a similar result for the Korean article

...and for completeness, the footnote has this {{Clarify}} note.

This should be done with something other than notepad, with a program that doesn't mangle newlines.

My problem with this claim isn't the use of Notepad; it's the claim that "This rarely happens in real documents". The supposed proof of this fact is that a certain HTML page took more space as UTF-16 than as UTF-8. HTML just happens to be a format that uses ASCII for its structuring, as is most XML. But this doesn't mean that "real" documents don't use unmixed non-ASCII characters! Any file format that doesn't use ASCII for its non-content parts (plain text, binary document formats) would absolutely take more space in UTF-8 than UTF-16, if their primary script were in the U+0800–U+FFFF range. Now, if you can prove that "real" plain-text or binary-format documents in Chinese, Japanese or Hindi scripts are "rarely" found... -- Perey (talk) 13:58, 27 May 2010 (UTC)

Though XML markup hugely skews things toward giving UTF-8 an advantage, even plain text has lots of spaces and newlines, quoted english, examples of computer code, arabic numbers, ASCII punctuation, etc. By far the most lengthy plain text documents even in China are huge tables of numbers and these are pure ASCII.Spitzak (talk) 17:39, 27 May 2010 (UTC)
I'll pay your point about newlines, and to an extent spaces and ASCII punctuation (only "to an extent" because other blocks often define their own distinct punctuation and may have different spacing rules). I can certainly see that you could well be right about the content of large text files making smaller text files (personal communications and the like, those having no quoted English and nothing to do with computer code) an utterly insignificant minority. But it's not referenced. I'd still really like to see some sort of citable source that supports us saying that UTF-8 "byte bloat" is not really a problem for those using the scripts that it theoretically affects (most of Asia and much of Africa). -- Perey (talk) 05:38, 30 May 2010 (UTC)

Here is a test: Text chosen is the Japanese Wikipedia entry for George Bush [[4]] This was chosen because I guessed it would be fairly long and not be computer-related. UTF-8 output was saved, size in UTF-8 was counted with "wc -c" (though just the size of the file would work), while size in UTF-16 was counted with "wc -m" multiplied by 2 (I assumed there were no non-BMP characters). All the files have plain LF as a newline, not CR+LF.

  • Actual html in UTF-8 207,329 vs UTF-16 354,388. That however is not very fair as there is a huge amount of HTML markup in that.
  • The contents of the "edit" panel, ie the wikipedia markup input. This is much more indicative of text somebody would type in. In UTF-8 31,317, in UTF-16 35,028.
  • To get an inverse sample, I picked the longest section in the article that had no images or tables. The results of the "edit" panel contents: UTF-8 2,401 and UTF-16 1,830.

Spitzak (talk) 19:29, 30 May 2010 (UTC)

An interesting exercise. :) It won't give us anything we can cite, since it seems like the very definition of original research, but it also seems like fun. I copied and pasted the whole article (omitting just the Wikipedia bits around the outside) into a text editor and saved it as UTF-8, UTF-16, and Shift-JIS, just for kicks. The results:
Shift-JIS 21807 bytes
UTF-16    27104 bytes
UTF-8     30076 bytes
And then I went looking for a lengthy article in Chinese about a Chinese topic, to minimise ASCII content. The family tree of late Chinese emperors (en - zh) looked promising, but thanks to the diagrams, the plain text content has a lot of white space, not to mention numbers. It looks really good for UTF-8:
Big-5  51712 bytes
UTF-16 94288 bytes
UTF-8  56926 bytes
The article on the PRC's government (en - zh) was probably a better test.
Big-5  58168
UTF-16 80880
UTF-8  89372
So, UTF-8 is 10% larger in two cases, and much, much smaller in one case. As I said, completely unciteable (and unscientific), but fun. (Yes, I have a weird definition of fun.) -- Perey (talk) 08:32, 31 May 2010 (UTC)

I should add another test that shows why all this argument is completely pointless. This is again with the "edit" text for the George Bush page:

UTF-8: 31317
UTF-16: 35028 (35032 when using iconv) (+11.8%)
bzip2 of UTF-8: 10351
bzip2 of UTF-16: 10797 (+4.3%)

The thing is that modern compression produces a far smaller file than either encoding, and since it relies on patterns it removes almost all the difference in the source encoding sizes.Spitzak (talk) 17:33, 31 May 2010 (UTC)

Note: After much staring I figured out why iconv added 4 bytes over wc -m: it added the BOM at the start, and my source had one more newline at the end.Spitzak (talk) 17:40, 31 May 2010 (UTC)

It's not completely pointless as long as the article makes unsourced claims, one way or another, about whether UTF-8 really has an impact on (uncompressed) plain text in scripts above the U+0800 mark. -- Perey (talk) 09:42, 12 June 2010 (UTC)
I would like to claim that this debate is pointless, since only a small portion of data storage and communicartion is text. For example I calculated (not in a scientificly documented way but still) that for this UTF-8 article in English, 86 kB is html tags (inside < and >), 3 kB is javascript and 32 kB is text, and images not included. Also it is not clear which of UTF-8 or UTF-16 is giving larger files. It depends on. --BIL (talk) 23:20, 12 June 2010 (UTC)
I'm afraid that (as I see it) your claim isn't proving the debate pointless; it's taking part in the debate. You're offering one more perspective on whether or not UTF-8 does or does not affect data sizes. Heck, that's not even the fundamental issue here. Fundamentally (again, just as I see it), the debate isn't whether it does or doesn't; the debate is whether we can provide a citable argument for either side, or whether we should just shut up about it and take it out of the article. ;-) (And for the record, your 89KB of HTML + Javascript is still transmitted in a text encoding, even if it's not part of the article text.) -- Perey (talk) 13:55, 20 June 2010 (UTC)

Slight problem with the table

UTF-8 is not my specialty so I will leave this for someone else to think about. Code points 10FFFF through 1FFFFF are not valid codes (or defined by unicode for other purposes) but only require a four byte sequence and not 5 as the table suggests. 1FFFFF would be coded as hex digits F7 BF BF BF - 4 bytes not 5Euc (talk) 01:21, 6 July 2010 (UTC)

I sort of agree, the table might want to show the range up to 1FFFFF. Ending it at 10FFFF can confuse people trying to figure out the bit assignments, also 10FFFF and 10FFFE might also be considered invalid characters, depending on your definition of Unicode, so why are they included?Spitzak (talk) 23:06, 6 July 2010 (UTC)
110000 through 1FFFFF are not within the Unicode code space, and therefore are not valid Unicode code points, and cannot legitimately be represented using UTF-8; and so should definitely not be included in the first table. On the other hand, 10FFFE and 10FFFF are valid Unicode code points, but just not Unicode characters, and can be represented using UTF-8, so do have a place in the table. BabelStone (talk) 23:19, 6 July 2010 (UTC)

Better four-byte example?

U+024B62 appears extremely uncommon (google "024B62") and unprintable. Can we get a better four-byte character to use as an example?

By their definition there are no very common characters in the SMP or SIP. U+024B62 𤭢 was deliberately chosen as it is a quite common and well known Chinese character (Beijing dialect for "to break", pronounced cèi), and is found in modern dictionaries of Chinese such as Xiandai Hanyu Cidian. If you google for it you will find that there are quite a few web sites that use it in running text, which is not the case for most characters in the SMP and SIP. As to being "unprintable", it is only unprintable if you do not have a suitable font installed -- and that is an issue with all 4-byte UTF-8 characters as the SMP and SIP are explicitly for less common characters, and there are relatively few fonts that cover characters in these two planes. Windows Vista and 7 come preinstalled with fonts that cover the SIP, but not with fonts that cover most characters in the SMP, so in fact this particular character is, in my opinion, a very good choice to use as an example. BabelStone (talk) 21:32, 12 August 2010 (UTC)

table

The table now looks great, except that it's too wide (mainly due to the small hexadecimal numbers in the red cells along the bottom). Unfortunately, most fonts have non-proportional (fixed-width) numerical digit characters 0-9, even when the rest of the font is proportional (variable width), so I'm not sure how to fix the problem... AnonMoos (talk) 18:12, 3 December 2010 (UTC)

I suppose the numbers could be removed from the invalid bytes, but they are interesting. They are missing from 0xC0 and 0xC1 already. Or they could be split onto two lines.Spitzak (talk) 20:20, 3 December 2010 (UTC)
The reason why 0xC0 and 0xC1 don't have numbers, is because the numbers show the lowest code-point which can be represented using non-overlong sequences beginning with that start-byte. 0xC0 and 0xC1 can't represent any code points in a non-overlong way, so they don't have numbers... AnonMoos (talk) 21:47, 3 December 2010 (UTC)
Was the use of 0xFE/0xFF ever considered for UTF-8? I can see two possibilities: first they can be two new 6-byte prefixes, allowing all 32 bit numbers to be represented. Or, to keep with the pattern, they can be a 7-byte prefix and an 8-byte prefix, allowing numbers up to 42 bits.Spitzak (talk) 20:20, 3 December 2010 (UTC)
0xFE/0xFF were deliberately left out of UTF-8 from the very beginning, so as to avoid potential confusion with Unicode characters UFEFF and UFFFE ("byte-order mark" and "guaranteed to be not a character"). Best to leave the 0xFE/0xFF cells blank... AnonMoos (talk) 21:47, 3 December 2010 (UTC)

Anyway, I just now thought of adding <small> tags, and it works for now as a temporary hack (at least the table now fits within a "maximized" browser-window on my computer)... AnonMoos (talk) 21:56, 3 December 2010 (UTC)

Design

Recent edits from user:AnonMoos (e.g. [5]) have pushed the notion that the UTF-8 encoding scheme was designed to avoid the byte values FE and FF (allegedly to avoid the possibility of BOM or anti-BOM appearing). But that's not listed in the design criteria, and I suspect the absence of FE and FF in Thompson's scheme for 31-bit values is quite fortuitous. Is there a reference to the contrary? -- Elphion (talk) 14:48, 8 December 2010 (UTC)

AnonMoos has now deleted [6], [7] material information from the article, and so far as I can see without good referential authority. There are at least three schemes being discussed: Thompson's original scheme for 31-bit values, the extension of that scheme to arbitrarily large values, and the restriction to the "21-bit" space by the Unicode standard. All of these deserve discussion here. In none of these cases was it a design principle to avoid FE and FF, much less to avoid the possibility of BOM in malformed sequences. (BOM can occur in malformed sequences even of the restricted 21-bit space, after all.) -- Elphion (talk) 15:47, 8 December 2010 (UTC)
No it can't (if the only malformation is disarranging start bytes and continuation bytes). And it's the purely hypothetical and speculative extension of UTF-8 beyond 6-byte sequences which is actually causing all the problems here -- what's your reference for that? AnonMoos (talk) 15:54, 8 December 2010 (UTC)
If a stream is disrupted to the extent that bytes are disarranged, one must also consider the possibility of spurious bytes. The "purely hypothetical and speculative" extension of UTF-8 beyond 6-byte sequences is not "causing all the problems". Yes, not having FE or FF in the original decreases the already vanishingly small probability of accidental BOMs in malformed sequences, but none of these schemes is foolproof against transmission errors. I don't know who developed the extension, and I had not known about it before reading about it here; but this is exactly the sort of in-depth knowledge I have come to expect from WP, and I'm sorry you feel the need to suppress it.
This is not, however, the primary point I'm raising. You've implied that the design of the standard version, as well as Thompson's, was driven by the desire to avoid BOMs. You've deleted information to bolster that. I've seen no evidence that Thompson was concerned about that, and standard Unicode was restricted to the 21-bit space for entirely different reasons. The absence of FE and FF in this particular encoding method is an additional bonus, but that's not why it was designed as it was. -- Elphion (talk) 16:36, 8 December 2010 (UTC)
You know, when UTF-8 strings are passed around and processed, and cut and rejoined, internally within an operating system or an application, then it's really not random "line noise" type garbling which is the main problem arising from such manipulations -- it's that strings will be cut apart and rejoined in the wrong places. In other words, precisely the problem of having start bytes and continuation bytes in the wrong order. The FE-FF thing was not the "driving force" behind UTF-8, but it was a final detail which fell into place, and not accidentally. AnonMoos (talk) 22:51, 8 December 2010 (UTC)

I agree that there is no indication that FE/FF were skipped in order to avoid the BOM. At the time UTF-8 was being designed I don't think the BOM value was even defined. Also all proposed uses of FE/FF would never have them next to each other (as they are both start bytes) and so confusion of the BOM would still be impossible.

I suspect the reason FE/FF were not defined was because of them not knowing what the best scheme to use them is. I can think of several:

  • Use all the last 4 codes as 6 byte prefixes of the form 111111xx, giving you all 32-bit numbers
  • The currently shown version that goes up to 42 bits
  • Increasing the sequence length by more than one: 11 bytes gives 64 bits and 2 or 3 "id" bits (allowing floating point numbers to be in the stream)
  • The previous unlimited-length version where more than one byte determines the length

Spitzak (talk) 18:55, 8 December 2010 (UTC)

Sorry, but the BoM was defined at that time (in 1992, the Unicode 1.0 printed books had been published, and ISO/IEC 10646 was already being unified with Unicode) -- and both FEFF and FFFE are conspicuously mentioned near the beginning of http://doc.cat-v.org/plan_9/4th_edition/papers/utf . And they could have easily put the continuation bytes in C0-FF and the initial bytes in 80-BF, which would have meant that FE and FF would commonly occur in UTF8 files. AnonMoos (talk) 19:23, 8 December 2010 (UTC)

Yes, BOM was already specified in 1992, but as AnonM acknowledges, it seems not to have played much of a part in the development of UTF-8 -- neither the article by Thompson and Pike linked above nor Pike's more informal account linked in our article's External links section mentions avoidance of FE and FF as a desideratum in the design. The key point of Thompson's design (and the main rationale for marking continuation bytes with "10") was the self-documenting structure of the codes, so that readers could synchronize easily with the beginning of characters; and that is precisely why it was adopted by the standard.

BOM is a minor point (though something of a headache for Unix). It is useful primarily in helping to identify the encoding, but since it is widely misused, robust programs do not rely on presence or absence of BOM, but use heuristic tests as well, as recommended by various standards. Once the coding is established, subsequent appearances of BOM in UTF-8 sequences (whether through transmission errors or careless manipulation) is simply an error. It's not a big deal, and the self-documenting nature of UTF-8 allows easy recovery. (In fact, one rationale for making FFFE and FFFF non-characters is that applications could safely use them as markers in string processing: there was never any sentiment that they should be avoided at all costs.)

In this article, I think it's fair to mention and even illustrate the extension of Thompson's scheme, pointing out that it maintains the self-documenting feature that was Thompson's main contribution; but that for larger values (as AnonM pointed out) it introduces FE and FF as lead bytes, unlike the standard version of UTF-8. I think it's fair in the discussion of standard UTF-8 to point out the absence of FE and FF and the consequent unlikelihood that a spurious BOM will occur, but let's not belabor that, as it was never a big deal in the design of UTF-8.

-- Elphion (talk) 14:39, 9 December 2010 (UTC)

If the continuation bytes were in C0-FF and the initial bytes in 80-BF, then the code would have been just as "self-documenting" -- 10 in bits 6 and 7 of a byte would have signaled a start byte, and 11 in bits 6 and 7 a continuation byte. That would even have been a little more "logical" than the actually-adopted UTF8 scheme, since start bytes would then precede continuation bytes in the codepage layout table, just as they do in byte sequences...
I know you think that it's neat that that the UTF8 scheme could (theoretically and purely hypothetically speculatively) be extended beyond 6-byte sequences, but you should also recognize that there are good reasons why UTF8 stopped where it did: 1) ISO/IEC 10646 only needed 31 bits. 2) Extending sequences beyond 6 bytes would have introduced FE and FF bytes into valid UTF8 text, something to be avoided if not strictly necessary (though not a "driving force"). And 3) extending sequences beyond 8 bytes would have meant that you couldn't tell the required length of a sequence just from the value of the initial start byte alone, and so would have made the scheme significantly less "self-documenting" and robust. AnonMoos (talk) 00:17, 10 December 2010 (UTC)
P.S. As Spivak has pointed out, there are actually a number of ways that UTF8 could have been extended to use FE and FF bytes, and I'm not entirely sure why this article should favor one hypothetical speculation over the others... AnonMoos (talk) 00:22, 10 December 2010 (UTC)

(1) It is clear from Pike's account that because of the ASCII backward compatibility they were already focused on the lead bit. So 0 meant an 1-byte code, and 1 a multibyte code -- and the natural place to distinguish lead bytes from continuation bytes is in the next bit. '0' is the obvious terminator for a sequence of '1's, so "10" is the obvious choice for the continuation marker. This rationale yields Thompson's scheme, and Pike's account gives no indication that FE or FF was a consideration at all. Your alternative scheme would have been another way to do it, but confers no real advantage -- and it lacks the transparent simplicity of Thompson's scheme. (2) The extension previously shown in the article is the only natural extension of Thompson's scheme: it preserves Thompson's design scheme and does the obvious thing when the byte-count requires more than one byte. (3) Of course Thompson and the standards committee had a good reason to stop where they did: Thompson stopped at 6 bytes because he was representing only the 31-bit UCS space; the standards committee at 4 bytes because they were representing only the 21-bit UTF-16 space. It had nothing to do with FE and FF. You can stop at whatever point you want to get the size space you want to represent -- that's the beauty of the scheme. -- Elphion (talk) 04:36, 10 December 2010 (UTC)

Reference for the extension of Thompson's scheme?

Does anyone know who's responsible for the extension of Thompson's scheme to arbitrarily large values? It would be good to have a reference. (Was it perhaps even Thompson himself?) -- Elphion (talk) 05:03, 10 December 2010 (UTC)

What I want to know is, does any extension beyond 6-byte sequences have any real existence beyond idle geeky speculations and musings? And if not, why should such extensions be included in this article? AnonMoos (talk) 06:57, 10 December 2010 (UTC)
This article is about, among other things, the mechanism that Thompson developed to code character values with a variable number of bytes, in such a manner as to make it easy to synchronize to the beginning of the codes. Multi-byte encodings had been tried before and were all found wanting. As Pike said, he and Thompson saw the opportunity to "use our experience" to develop a really good encoding mechanism. What we might call their "geeky and speculative" bent (otherwise known as their mathematical approach) allowed them to come up with a mechanism that not only solved the problem brilliantly for the UCS space -- something hordes of narrowly-focused programmers had been unable to do for years -- but also solved the problem in a fashion that is easily and consistently extensible to any arbitrarily large space as well. The genius and utter simplicity of Thompson's mechanism represents a fundamental break with previous practice -- it blows all the other schemes out of the water -- and the power of this approach is not fully conveyed without noting its extensibility. -- Elphion (talk) 14:04, 10 December 2010 (UTC)
Yes, the fundamental idea was somewhat nifty (though not completely different from Huffman coding, which had been known for about 40 years at that point), but its theoretical extendability beyond 6-byte sequences was blocked by number of factors in the concrete case of UTF-8 -- see (1), (2), (3) listed above at "00:17, 10 December 2010" -- and I'm not sure exactly why such theoretical extendability deserves much detailed discussion or concrete exemplification (beyond a basic mention) in the UTF-8 article... AnonMoos (talk) 18:45, 10 December 2010 (UTC)
According to Rob Pike, the inspiration for the prefix code idea actually came from IP Address classes where a Class A address had a prefix of 0 in its first byte, a Class B address had a prefix of 10, &c. the implementation of which they had been working on for their Plan 9 system. --Wtrmute (talk) 03:22, 28 August 2011 (UTC)

In a variable length coding system, each code must convey two things: (A) how long the code is, and (B) what the coded value is. Most such schemes rely to some degree on implicit information: certain ranges are handled in special ways, certain values have to be computed from special values. UTF-16 is a good example: words that fall in the range D800..DCFF signify two-word codes, whose value is computed from the bit pattern of the code using special constants. The various CJK multibyte schemes have similar properties.

Thompson's scheme for values > 127 is the first variable-length character encoding (that I know of) where both the length (A) and the value (B) are stored explicitly in the code itself: the byte count (A) is coded by a string of 1's, followed by a 0, followed by the numerical value (B) of the code. These are packed into the code bytes with the string of 1's starting in the high bit of the first byte, and the data continuing into the free bits of continuation bytes (whose first two bits are reserved and set to "10" to mark them as continuation bytes). Thus data items (A) and (B) are stored continuously in the data bits of the code bytes (avoiding the continuation markers), and (B) is padded with 0's at the high end so that its LSBit lands in the LSBit of the last byte of the code.

That's a simple description of Thompson's scheme. Although it was designed for a space that requires a maximum of 6 bytes, it can represent arbitrarily high values. You can stop at 6 bytes if you're interested in representing the UCS space; you can stop at 3 bytes if you're interested in representing BMP; you can stop at 4 bytes if you're interested in the standard Unicode space. But the system itself is completely general. To quote the passage (not mine) that you chose to suppress: it is "sufficiently general to be extended indefinitely to any number of bytes and an unlimited number of bits".

Obviously this is not true of standard UTF-8. The standards committee needed only a 4-byte-max subset to represent their restricted character space, whose size was already determined. The extensibility of Thompson's scheme is not "blocked" by the standard; it's simply not needed by the standard. Since they stop at 4 bytes, their encoding does not use FE or FF and contains the length of the code within the lead byte: nice but inessential additional properties. They don't negate the general extensibility of Thompson's scheme to larger spaces.

(And I confess ignorance: I see no similarity between Thompson and Huffman -- they are "codings" in completely different senses of the word.)

-- Elphion (talk) 01:39, 11 December 2010 (UTC)

I do like the inclusion of methods to encode more than 21 bits, since they are not part of UTF-8, and especially I do not like inclusion of methods to encode more than 31 bits. There is no reference that this scheme was suggested, and that would mean the article contains speculative original research, a private suggestion to enhance UTF-8, which Wikipedia should not contain. --BIL (talk) 06:40, 11 December 2010 (UTC)
BIL -- assume your first "do like" was intended to be a "do not like"? The five-byte and six-byte sequences are not really the problem here, since they were part of the original UTF-8 specification in the 1990s (and were only later defined out). But I agree with you that it's very dubious whether seven-byte or longer sequences should be included on this article... AnonMoos (talk) 07:29, 11 December 2010 (UTC)
I meant "do not". --BIL (talk) 10:28, 11 December 2010 (UTC)
Elphion -- the details of the Huffman algorithm are of course not relevant to UTF-8, but Huffman coding and UTF-8 are both examples of variable-length binary prefix codes, something which was well-understood in information science long before the 1990s (Huffman coding probably being the most prominent example). If you want to establish an article on the pure abstract Prosser-Thompson prefix coding scheme (distinct from its actual use in UTF-8), then that would be the place to go into great detail about extensions beyond six-byte sequences -- but not really on this article... AnonMoos (talk) 11:25, 11 December 2010 (UTC)
I'm amenable to splitting the article. I'm even willing to do most of the work, though maybe not immediately ;-). I understand your reluctance to plumb the details of Thompson's scheme in this article, since UTF-8 uses only a subset. But I do feel the scheme needs to be presented so that it is clear that Thompson's main contribution is the consistent expandability to various-sized character spaces in a manner that makes synchronization easy; and that would be appropriate in the new article.
If all you mean by "similar to Huffman" is that it's variable-length and prefix-free, well, yes: any practical character coding is prefix-free (otherwise the character boundaries are too hard to discover), and there have been variable-length character encodings for a long time. No one is claiming that Thompson invented either concept. His contribution is the explicit coding of both length and value in the code -- a clear break from previous variable character encodings.
-- Elphion (talk) 17:46, 11 December 2010 (UTC)
Didn't notice it until now, but the whole FE/FF thing was already in the article before you made your first edit -- see http://en.wikipedia.org/w/index.php?title=UTF-8&oldid=400753213#Compared_to_single-byte_encodings
Anyway, I greatly condensed the whole beyond-6-bytes thing, in accordance with discussions. AnonMoos (talk) 13:43, 13 December 2010 (UTC)

Trivia

By the way, ISO-8859-1 characters 0x80 to 0xBF (C1 controls and upper punctuation) encode to a UTF-8 sequence with a 0xC2 byte (Â) followed by themselves... AnonMoos (talk) 18:35, 27 December 2010 (UTC)

That's not strictly true; the 0xC2 is there, but does not represent a “” character. You have to be ever so careful not to conflate characters and bytes in this area, or you end up terribly confused. (A great many programmers seem to be part of the confused…) –Donal Fellows (talk) 09:39, 2 August 2011 (UTC)
I didn't say that they encode to a sequence of a C2 "character" followed by a byte identical to their ISO-8859-1 representation, I said they encode to a sequence of a C2 byte followed by a byte identical to their ISO-8859-1 representation... AnonMoos (talk) 15:28, 2 August 2011 (UTC)

Codepoints

I removed some detail about what UTF-8 "can" represent. Our article on Code point calls the entire range 0..0x10FFFF of the Unicode space the Unicode "codepoints" even though some of those values (despite being representable as a well-formed UTF-8 sequences) are not valid characters. It's not just the surrogates; there are unassigned characters, and there are other permanently reserved characters, like FFFE and FFFF in each plane. Once you start saying what "can't" be represented you need further fussy language about various exclusions. Let's leave it at "UTF-8 can represent all the codepoints, even though not all of those are legal characters." (This point is already made it the footnote to the first sentence in the article's second paragraph.) And in fact, many applications do use UTF-8 to represent non-legal characters. -- Elphion (talk) 22:40, 30 April 2011 (UTC)

Second sentence

Second sentence says:

Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set.
  • Well, an unsigned 16-bit integer can only hold 65,535 different values, and there are over 109,000 characters in Unicode, so no... UTF-16 cannot represent every Unicode character.
  • Alternatively, UTF-32 can hold the entire Unicode set because an unsigned 32-bit integer can hold 4,294,967,295 different values.

Hhh3h (talk) 19:47, 13 May 2011 (UTC)

If you make the effort to read the UTF-16 article, you will see that UTF-16 has a mechanism to represent all about 1 million possible characters in Unicode. --BIL (talk) 19:52, 13 May 2011 (UTC)

Overlong: is it really wrong?

If four bytes are used to encode, say, ASCII NUL, what's wrong with that, other than wasting space? Is there some document which says this is not allowed and should be diagnosed? Why? Are these encodings reserved for future extension, and is such an extension a really a good idea? (E.g. using a three byte encoding of NUL to signal something else?) I think this just complicates decoders, because their state machine has to remember what length of code it is decoding. "oops, we got a zero, but we were decoding three bytes!". This is kind of like saying that 00 or 0.00 is no longer a valid way of writing zero. 24.85.131.247 (talk) 20:34, 28 January 2012 (UTC)

See RFC 3629, section 10. But there exists an encoding named "modified UTF-8" which differs exactly in code value for NUL. Incnis Mrsi (talk) 21:06, 28 January 2012 (UTC)
Yes, it is wrong. Allowing overlong encodings means there is more than one way to encode a character. This breaks search, and therefore leads to security vulnerabilities. And decoding is not harder at all, because all it does is limit the second byte to a smaller range, which takes no more time to check than making sure it is 80-BF:
  • For 2-byte, the first byte cannot be C0 or C1
  • For 3-byte, if the first byte is E0, the second byte must be >=A0
  • For 4-byte, if the first byte is F0, the second byte must be >=90

Spitzak (talk) 00:20, 29 January 2012 (UTC)

I came up with a way that was simpler to retrofit in the particular implementation. When the first byte is processed, we know the minimum value of the character that must to be retrieved from the remaining bytes. At that point, we can configure minimum value, as part of the decoder state. When retrieving the character is finished, we can compare it to the minimum value which will be 0x80, or 0x800 or 0x10000.192.139.122.42 (talk) 00:57, 3 February 2012 (UTC)
24.85.131.247 -- Allowing multiple legitimate encodings would break the strict 1-1 mapping relationship between valid UTF-8 and Unicode code points, and has been used as a basis for security exploits. The overlong and beyond-Plane-16 forms can basically be avoided by simply rejecting all C0, C1, and F5-FF bytes; rejecting 80-9F bytes which directly follow an E0 byte; rejecting 80-8F bytes which directly follow an F0 byte; and rejecting 84-BF bytes following an F4 byte (in addition to rejecting incomplete or malformed sequences, of course) -- which doesn't seem all that excessively complex to me... AnonMoos (talk) 11:56, 29 January 2012 (UTC)
Sorry, folks, thanks for your responses, but I still do not understand how merely decoding multiple variants leads to security vulnerabilities (such that the fault lies with the robust decoding and not elsewhere). If this is so, should we not be concerned about every situation in which a semantic datum can be written in more than one way? For instance, in some notations, 0xFF and 255 for instance mean the same thing. How would you feel about a programming language in which array[255] was diagnosed as an array overrun (because the array is only 255 elements wide), but array[0xFF] went undetected? In some string notations we can write something like "\x41" instead of "A". Are these multiple representations prima facie security vulnerabilities? Wouldn't the bug be in the software which is naively processing the encoded representation of the data without considering the representational cases? That is to say, suppose that the datum "A" is some security sensitive command or operation code that should be rejected in the absence of sufficient privilege. Now suppose that the attacker can defeat this check by using "\x41" instead. Where is the software defect? Is it because the decoder processed the "\x41" variant into a datum indistinguishable from "A"? Or is it really because the security check was implemented over the printed notation of the data, and failed to take into account all the variants? Or is it because one algorithm is used for validating, and another one for processing, and they trust each other? I can see that if a naive algorithm is used for validating UTF-8 strings, which are then trusted and decoded after naive validations are applied, there is a problem. The solution is: don't perform security checks on raw UTF-8. Ever. Always decode it. When UTF-8 data enters into your system, decode it immediately as the first processing step, and then deal with abstract characters. Why should I bother rejecting overlong codes in a program which already works this way? E.g. which decodes UTF-8 as it is reading it from a stream, and then applies validation? if someone wants to burn three bytes encoding NUL, that's fine and my program should be robust enough to recognize this character for what it is, just like 00000.0E+0 is recognized as 0.0 (and converted to a floating point datum before being compared to anything!)192.139.122.42 (talk) 01:50, 2 February 2012 (UTC)
If a UTF-8 decoder were being implemented in hardware using ca. 1975 technology, then your proposal would save a few NAND gates or whatever, but from most other points of view it would appear to create more problems than it would solve. You seem to assume that UTF-8 is only used as a transfer encoding, and that all programs change UTF-8 to some other representation before doing anything with Unicode data, but that's really not at all the case... AnonMoos (talk) 05:15, 2 February 2012 (UTC)
"Sorry, folks, thanks for your responses, but I still do not understand how merely decoding multiple variants leads to security vulnerabilities"
"If this is so, should we not be concerned about every situation in which a semantic datum can be written in more than one way?"
Absoloutely you should be concened. You can mitigate those concerns but the mitigations come at a cost. So if you are designing a new format it is often better to say "there is exactly one correct encoding of each symbol". If you are not desiging your own format you should respect the descisions made by those who did design it.
The key is ONCE DESIGN DESCISIONS LIKE THIS ARE MADE THEY MUST BE RESPECTED. Not respecting them leads to secruity issues when different people's software is combined to form a larger system.
"The solution is: don't perform security checks on raw UTF-8. Ever. Always decode it."
That would have been one soloution. However it would have DEFEATED THE OBJECT OF UTF-8 which was to provide a unicode encoding that could be safely used in programs that assumed an "extended ASCII" encoding but did not care about the meaning of byte values beyond 128. A key aspect of making it safe for such programs to work with was ensuring that each unicode code point (and especially code points from ASCII) had EXACTLY ONE valid encoding.
If a program truly does not care about the meaning, then that is fine. Isn't it a problem that sometimes such programs do make assertions about the meaning of those characters? That appears to be the crux of the issue: programs that know nothing about what 0x80-0xFF stands for are putting some kind of stamp of approval on such cases as being valid input, and then this assurance is trusted in other programs. I think that programs which understand only ASCII can in fact be used to validate data, if they stick to positive pattern matches. I.e. look for a specific space of valid inputs and reject all else (as opposed to looking for a space of bad inputs and accepting all else).192.139.122.42 (talk) 22:12, 2 February 2012 (UTC)
"Why should I bother rejecting overlong codes in a program which already works this way?"
Your program may be used in combination with encoding agnositc programs. By ignoring the design descisions made by the designers of UTF-8 you are opening up potential security vulnerabilities when your software is combined with such software.
If it is being combined by me, then the overall combination is effectively a system that I designed and I must take the responsibility. I am not responsible for careless software combinations made by others, surely? 192.139.122.42 (talk) 22:12, 2 February 2012 (UTC)
-- Plugwash (talk) 15:51, 2 February 2012 (UTC)
The issue for me isn't complexity or performance, just basic robustness. If a user has data in which some character is encoded in a funny way, I want the program to gracefully handle the user's data instead of annoying that user by throwing an exception. How about having a configuration option to support relaxed UTF-8 or not? Users don't care about why something doesn't work; they just tell themselves "well, this program is a piece of crap which chokes on my data" and move on to another solution.192.139.122.42 (talk) 22:12, 2 February 2012 (UTC)
I very much agree that UTF-8 implementations should not throw exceptions on invalid UTF-8. Instead they should translate them into some safe value and continue. But "safe values" means something that there is no possibility of interpreting in any manner that will confuse things. If you know you are going to just draw the string, I recommend translating the first byte of an error into the matching CP1252 character and then continuing the UTF-8 decoding with the next byte, as this will produce the most likely legible result of non-UTF-8 being inserted into the data. For any other purpose, translating to a special value that cannot be confused with any Unicode code point is a requirement (for instance a number greater than 0x10FFFF). However neither of these is what you are asking for. I do not want the errors turning into anything that might in any way be useful.Spitzak (talk) 03:39, 3 February 2012 (UTC)
Is there a way in which these strict UTF-8 checks are more than just expressing a form of distrust against programs that are in a position of trust? Is there some concrete example in which the program which is decoding UTF-8 data, allows these extra forms, and is inherently fault for creating the security hole (without consideration of other broken software, added by others, that provides security assurances over data that it does not properly understand?)192.139.122.42 (talk) 22:12, 2 February 2012 (UTC)
I don't quite see how this discussion is aimed at improving this article, but returning to the original question, "Is there some document which says this is not allowed and should be diagnosed?", then I suggest taking a read of the appropriate sections of the Unicode standard, in particular Definition D92 which states that non-shortest form UTF-8 byte sequences are ill-formed, and Conformance Requirement C10 which states that processes "shall treat ill-formed [UTF-8] code unit sequences as an error condition and shall not interpret such sequences as characters". BabelStone (talk) 23:06, 2 February 2012 (UTC)
Re: improving the article Arg, soooo sorry about this, darn! *redface* 192.139.122.42 (talk) 00:57, 3 February 2012 (UTC)
Just one more thing. Thanks for all the discussion, people. I've done some thinking and answered a lot of these questions for myself. Here is one way in which being careless with UTF-8 can cause problems. For instance, suppose we have file names in a POSIX-like directory. The filesystem treats names as null-terminated byte strings. Someone could craft a file name whose binary image contains an invalid code for, say, the ASCII / (slash) character. Of course a normal slash is a path separator and so could not occur in a directory entry. So our naive UTF-8 program decodes the file name, and now it has a string with slashes in it: a multi-component path name. Later, our program uses this path to issue an open() system call and read the file. A privileged program could be, for instance, fooled into divulging a security sensitive file, or even overwriting it. This kind of thing could evade the program because it's just not expected that when you read a directory entry, it might contain slashes. You don't want to be checking for that. One correct principle here is transparency. Interpret the bytes to have some encoding, if you will, but recover the exact same bytes when sending them back into the environment from which they came, or else throw exceptions on invalid inputs. This problem can happen in other ways. For instance suppose you map invalid bytes to U+DCxx and restore these on the reverse conversion. If your UTF-8 decoder also accepts a multi-byte-encoded U+DCxx, then your program is easily fooled into reproducing whatever bytes the attacker wants. DC2F is fed to the program, and it produces a 2F slash. Oops! But of course transparency is not the only point: it's also important that the "fake slash" is not seen as a slash in the program because it will screw up the path handling. Anyway, I think I sorted out all these issues in the software at hand. Thanks. 24.85.131.247 (talk) 09:57, 3 February 2012 (UTC)
U+DC2F cannot be produced as an error byte. Error bytes in UTF-8 have to have the high bit set (as any byte without the high bit set is a correctly-encoding ASCII character). Therefore any back-translator would not turn DC2F into 2F.
You are still making the incorrect assumption that somehow the UTF-8 must be "decoded" before it is used to find files. The whole point of UTF-8 is that it can be used straight by code designed to work with sequences of bytes. The Unix file api will only recognize the byte 0x2F as a slash and handle it specially. Thus it will not see the overlong encoding as a slash. For this reason, no other code handling the UTF-8 should interpret the overlong encoding as a slash either.Spitzak (talk) 21:01, 7 May 2012 (UTC)